Setup

Objective
To showcase the minimum number of steps
required to do tertiary analysis of DNA + Protein
and some of the different ways to look at the data

Major questions answered:

  1. Do we see DNA clones?
  2. Do we see protein cell types
  3. Is the differential expression significant?
  4. Do the clones correlate with the cell types?

Things not shown:

  1. All available methods eg. Filtering of nearby variants, variant annotation, plots

  2. Discussing all methods and their options - Documented here

  3. Systemic variations seen in protein data - Next session

In [1]:
import missionbio.mextools.io as mio

h5path = './sample.h5'
sample = mio.load(h5path, raw=False)
Loading, ./sample.h5
Loaded in 0.3s

H5 files are a replacement of loom files.

Where to get them?

  1. These are part of the DNA and protein pipeline output

  2. The sample h5 used in this workflow can be found here

Note: This is a h5 file trimmed specifically for this analysis

Data structure

Dna, Cnv, and Protein are sub classes of the Assay class
The information is stored in four ways, and the user
can change each of those

1. metadata (add_metadata / del_metadata):
    dictionary containing metrics of the assay

2. row_attrs (add_row_attr / del_row_attr):
    dictionary which contains 'barcode' as one of
    the keys. All the values must be of the same
    length i.e. match the number of barcodes
    This is the attribute where 'label', 'pca',
    and 'umap' values are added

3. col_attrs (add_col_attr / del_col_attr):
    dictionary which contains 'ids' as one of
    the keys. All the values must be of the same
    length i.e. match the number ids
    'ids' contains variants for DNA assays
    and anitobides for Protein assays

4. layers (add_layer / del_layer):
    dictionary containing 'read_counts' as one of 
    the metrics. All the values have the shape
    (num barcodes) x (num ids). This is the attribute
    where 'normalized_counts' will be added

Sample holds the Dna and Protein information
In [3]:
sample.protein
Out[3]:
<missionbio.mextools.protein.Protein at 0x7fdabc756410>
In [4]:
sample.protein.metadata
Out[4]:
{'sample_name': array(['Compass028_Nextseq'], dtype=object),
 'n_reads': 128914059,
 'n_reads_trimmed': 128590556,
 'n_reads_valid_ab_barcodes': 117037906,
 'n_reads_valid_cell_barcodes': 121712026,
 'pipeline_version': '1.1.0'}
In [5]:
sample.protein.row_attrs
Out[5]:
{'barcode': array(['AACAACCTAAACTTGTCG', 'AACAACTGGTACGTTGGA', 'AACAATGCAAGACCACGC',
        ..., 'TTGTCAACCTACAACACC', 'TTGTCAACCTAGTAACGG',
        'TTGTTAGAGATCAGGATG'], dtype=object),
 'sample_name': array(['Compass028_Nextseq', 'Compass028_Nextseq', 'Compass028_Nextseq',
        ..., 'Compass028_Nextseq', 'Compass028_Nextseq',
        'Compass028_Nextseq'], dtype=object)}
In [6]:
sample.protein.ids()
Out[6]:
array(['CD110', 'CD117', 'CD123', 'CD135', 'CD19', 'CD24', 'CD3', 'CD33',
       'CD34', 'CD38', 'CD44', 'CD45', 'CD56', 'CD90', 'HLA-DR',
       'Mouse IgG1k'], dtype=object)
In [7]:
sample.dna.layers
Out[7]:
{'AF': array([[ 5.79710145,  0.        ,  0.72463768, ...,  1.68539326,
          1.68539326,  0.        ],
        [14.49275362,  1.2345679 ,  1.25      , ...,  2.02020202,
          0.        ,  1.01010101],
        [ 2.80373832,  0.        ,  0.92592593, ...,  2.45901639,
          0.        ,  2.45901639],
        ...,
        [ 1.86335404,  2.48447205,  0.625     , ...,  1.2195122 ,
          0.6097561 ,  0.        ],
        [ 2.87769784,  0.71942446,  2.14285714, ...,  3.33333333,
          0.        ,  0.        ],
        [ 8.33333333,  0.        ,  1.66666667, ...,  0.97087379,
         11.70212766,  0.        ]]),
 'DP': array([[138, 138, 138, ..., 178, 178, 178],
        [ 69,  81,  80, ...,  99,  99,  99],
        [107, 108, 108, ..., 122, 122, 122],
        ...,
        [161, 161, 160, ..., 164, 164, 164],
        [139, 139, 140, ...,  90,  90,  90],
        [ 60,  60,  60, ..., 103,  94, 103]], dtype=int16),
 'GQ': array([[99, 99, 99, ..., 99, 99, 99],
        [99, 99, 99, ..., 99, 99, 99],
        [99, 99, 99, ..., 99, 99, 99],
        ...,
        [99, 99, 99, ..., 99, 99, 99],
        [99, 99, 99, ..., 99, 99, 99],
        [25, 99, 99, ..., 99, 99, 99]], dtype=int8),
 'NGT': array([[0, 0, 0, ..., 0, 0, 0],
        [1, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 1, 0]], dtype=int8),
 'AF_NO_MISSING': array([[ 5.79710145,  0.        ,  0.72463768, ...,  1.68539326,
          1.68539326,  0.        ],
        [14.49275362,  1.2345679 ,  1.25      , ...,  2.02020202,
          0.        ,  1.01010101],
        [ 2.80373832,  0.        ,  0.92592593, ...,  2.45901639,
          0.        ,  2.45901639],
        ...,
        [ 1.86335404,  2.48447205,  0.625     , ...,  1.2195122 ,
          0.6097561 ,  0.        ],
        [ 2.87769784,  0.71942446,  2.14285714, ...,  3.33333333,
          0.        ,  0.        ],
        [ 8.33333333,  0.        ,  1.66666667, ...,  0.97087379,
         11.70212766,  0.        ]])}

DNA Analysis

Topcis covered

  1. Whitelist of variants
  2. Manually selecting variants

Basic filtering

Many filtering options are available
use the documentation shared earlier,
or the help() function to get the same
information here
In [8]:
help(sample.dna.filter_variants)
Help on method filter_variants in module missionbio.mextools.dna:

filter_variants(min_dp=10, min_gq=30, min_vaf=20, min_prct_cells=50, min_mut_prct_cells=1, min_std=0) method of missionbio.mextools.dna.Dna instance
    Remove uninformative variants
    
    Parameters
    ----------
    min_dp : int
        minimum depth (DP) for call to be considered.
        Variants with less than this DP in a given
        barcode are treated as no calls
    min_gq : int
        minimum genotype quality (GQ) for call to be
        considered. Variants with less than this GQ
        in a given barcode are treated as no calls
    min_vaf : float [0, 100]
        if the vaf of a given vairant for a given barcode
        is less than the given value, and the call is
        either '1' or '2', then it is converted to a
        no call
    min_prct_cells : float [0, 100]
        minimum percent of total cells in which the variant
        should be called as '0', '1' or '2' afterthe
        filters are applied
    min_mut_prct_cells : float [0, 100]
        minimum percent of the total cells in which the
        variant should be mutated i.e. either '1' or '2'
        after the filters are applied
    min_std : float [0, 1]
        standard deviation of the vaf across cells
        of the variants should be greater than
        this value
    
    Returns
    -------
    numpy.ndarray

In [9]:
# Filter variants
# This is the default insights filtering method

dna_vars = sample.dna.filter_variants()
dna_vars
Out[9]:
array(['chr2:25458546:C/T', 'chr2:25469502:C/T', 'chr2:25470426:C/T',
       'chr2:25470573:G/A', 'chr2:209113192:G/A', 'chr4:55599436:T/C',
       'chr4:106154990:TATAGATAG/T', 'chr4:106154990:T/TATAG',
       'chr4:106155133:A/G', 'chr4:106158216:G/A', 'chr4:106190862:T/C',
       'chr4:106194275:T/C', 'chr4:106197469:G/A', 'chr6:40116264:T/G',
       'chr6:62094287:A/T', 'chr7:148506064:A/G', 'chr7:148526923:A/G',
       'chr7:148529851:G/GA', 'chr7:148543525:A/G', 'chr10:5554293:T/C',
       'chr10:77210191:C/T', 'chr10:106721610:G/A', 'chr11:32414333:G/T',
       'chr11:32417945:T/C', 'chr11:32421542:G/A', 'chr12:112888239:C/T',
       'chr13:28597686:G/A', 'chr13:28610183:A/G', 'chr14:56969005:C/T',
       'chr16:8569820:G/A', 'chr17:7577427:G/A', 'chr17:7578115:T/C',
       'chr17:7578176:C/T', 'chr17:7578263:G/A', 'chr20:31023356:G/T',
       'chr21:36252917:C/T'], dtype='<U26')
In [10]:
# Check the number of filtered variants

len(dna_vars)
Out[10]:
36

Whitelist

Simply appnding the whitelist to the list of filtered
variants is sufficient to then select the variants
using the slice notation

i.e. sample.dna[{list of barcodes}, {list of ids}]
In [11]:
whitelist = ['chr1:115256513:G/A', 'chr21:44514718:C/T']
In [12]:
final_vars = whitelist + list(dna_vars)
In [13]:
len(final_vars)
Out[13]:
38
In [14]:
sample.dna.shape
Out[14]:
(2476, 936)
In [15]:
# Selecting all cells and final variants

sample.dna = sample.dna[sample.dna.barcodes(), final_vars]
In [16]:
# Check the shape i.e. (Number of barcodes, number of ids)
# of the final filtered dna object

sample.dna.shape
Out[16]:
(2476, 38)

Manual variant selection

Heatmaps are interactive. Clicking on it selects
the corresponding id whose value is stored in the
`selected_ids` attribute of the object

eg. sample.dna.selected_ids
In [17]:
sample.dna.stripplot(layer='AF', color_layer='GQ')
In [23]:
sample.dna.heatmap(layer='AF')
In [20]:
sample.dna.selected_ids
Out[20]:
array(['chr16:8569820:G/A', 'chr6:40116264:T/G', 'chr10:77210191:C/T',
       'chr7:148526923:A/G', 'chr4:106194275:T/C', 'chr4:106155133:A/G',
       'chr17:7578115:T/C', 'chr4:106154990:TATAGATAG/T'], dtype='<U32')
In [24]:
sample.dna = sample.dna.drop(sample.dna.selected_ids)

Clustering

DNA has a custom clustering method called `find_clones`

It projects the data on a UMAP and then performs
dbscan to identify unique clusters, which are then
merged in case they were formed due to missing
information
In [25]:
sample.dna.find_clones()
[0 1 2 ... 3 3 3]
Unique clusters found - 7
Clusters after removing missing data - 6
In [26]:
sample.dna.row_attrs
Out[26]:
{'barcode': array(['AACAACCTAAACTTGTCG', 'AACAACTGGTACGTTGGA', 'AACAATGCAAGACCACGC',
        ..., 'TTGTCAACCTACAACACC', 'TTGTCAACCTAGTAACGG',
        'TTGTTAGAGATCAGGATG'], dtype=object),
 'sample_name': array(['Compass028_Nextseq', 'Compass028_Nextseq', 'Compass028_Nextseq',
        ..., 'Compass028_Nextseq', 'Compass028_Nextseq',
        'Compass028_Nextseq'], dtype=object),
 'umap': array([[-3.0507529 ,  2.364983  ],
        [-3.393835  ,  5.8391247 ],
        [-5.647976  , -8.861678  ],
        ...,
        [ 6.620932  , -0.40666258],
        [ 6.542046  , -0.99099886],
        [ 5.227692  , -0.5960863 ]], dtype=float32),
 'label': array(['Clone D', 'Clone E', 'Clone C', ..., 'Clone F', 'Clone F',
        'Clone F'], dtype='<U7')}
In [30]:
sample.dna.scatterplot(attribute='umap', labels=True)
In [29]:
sample.dna.heatmap(layer='AF')

Conclusion

1. Basic filtering of barcodes ids demonstrated
2. Basic DNA filtering functionality showcased

CNV Analysis

Preliminary heatmap of CNV shows that there could be two clusters

Topics covered

  1. Down sampling options and their effects
In [31]:
sample.cnv.heatmap(layer='normalized_counts')

PCA options

Here the UMAP options are kept constant
The only parameter in PCA is the number of components

Here we see how to determine this value, and the effect
when we deviate from this value
In [32]:
sample.cnv.run_pca(PCA_components=6, layer='normalized_counts', show_plot=True)
sample.cnv.run_umap(attribute='pca', min_dist=0, n_neighbors=100)
In [ ]:
# Too many PCA components

sample.cnv.run_pca(PCA_components=15, layer='normalized_counts')
sample.cnv.run_umap(attribute='pca', min_dist=0, n_neighbors=100)
In [ ]:
# Too few PCA components

sample.cnv.run_pca(PCA_components=3, layer='normalized_counts')
sample.cnv.run_umap(attribute='pca', min_dist=0, n_neighbors=100)

Visualization

The result of the downsampling analysis is
visualized using a scatterplot of the umap
In [33]:
sample.cnv.cluster(attribute='umap', method='dbscan', eps=0.55)
[0 0 0 ... 1 1 1]
In [34]:
sample.cnv.scatterplot(attribute='umap', labels=True)

Conclusion

Given all other variables are kept constant

1. Too many PCA components may result in merging of clusters
2. Too few PCA component may result in splitting of clusters
3. The appropriate number of components can be determined using the elbow plot

Protein Analysis

Topics covered

  1. Basic workflow
  2. Custom clustering eg. selection on biaxial plot
  3. Custom methods by adding layers

Basic workflow

In [35]:
# Downsampling and clustering similar to CNV

sample.protein.normalize_reads('CLR')
sample.protein.run_pca(PCA_components=5, layer='normalized_counts')
sample.protein.run_umap(attribute='pca')

sample.protein.cluster(attribute='pca', method='graph-community')
Creating the Shared Nearest Neighbours graph
--------------------------------------------------
##################################################
Identifying clusters using louvain community detection

Number of clusters found: 10
Modularity: 0.752
In [36]:
sample.protein.heatmap(layer='normalized_counts')
In [37]:
sample.protein.scatterplot(attribute='umap', labels=True)
In [38]:
# Re cluster based on the observations from the UMAP

sample.protein.cluster(attribute='umap', method='dbscan')
[0 1 2 ... 3 3 3]
In [43]:
sample.protein.scatterplot(attribute='umap', labels=True)
In [40]:
# Prefered way to look at protein expression profiles
# In case of error, make sure that ids have been selected on the heatmap and shown in sample.protein.selected_ids

sample.protein.ridgeplot(layer='normalized_counts',
                         labels=True,
                         feature_ids=sample.protein.selected_ids)
In [42]:
# UMAP with the expression for each of the selected protein overlayed
# In case of error, make sure that ids have been selected on the heatmap and shown in sample.protein.selected_ids

sample.protein.scatterplot(attribute='umap',
                           layer='normalized_counts',
                           feature_ids=["CD34", 'CD44', 'HLA-DR'])

Custom clustering

When `labels=False` for any scatterplot
the lasso tool can be used to cluster cells
based on the selection made
In [44]:
# Selction on biaxial scatterplot
# The same can be done for UMAP when labels=False is passed

sample.protein.feature_scatter(layer='normalized_counts',
                               ids=['CD90', 'CD3'],
                               labels=False)
In [ ]:
# Set the labels based on the selection in the plot

sample.protein.set_selected_labels()

Custom methods by adding layers

If someone is interested in trying their methods,
they can simply modify the appropriate layers, attributes
and metadata to plugin their step in this workflow
In [45]:
# Custom normalization by changing the `normalized_counts` layer

import numpy as np

log_reads = np.log10(10 + sample.protein.layers['read_counts'])
norm = np.divide(log_reads, log_reads.mean(axis=1).reshape(-1, 1))


sample.protein.add_layer('normalized_counts', norm)
Other examples include:

custom labels -> 'label' row_attr
custom palette -> 'palette' metadata   

Conclusion

1. Protein analysis workflow similar to CNV
2. Different clustering methods can result in
   different types of clusters being identified
3. It is possible to have custom clustering for
   any scatterplot by using the lasso tool
4. Custom analysis is possible by modifying appropriate
   layers, attributes and metadata

Statistics

The significane of differential expression
based on a t-test can be looked at using
the `feature_signature` method
In [46]:
med, std, pval, tstat = sample.protein.feature_signature(layer='normalized_counts')
In [47]:
pval
Out[47]:
0 1 2 3
CD110 2.128354e-09 6.113514e-111 2.675321e-12 1.021417e-132
CD117 1.026774e-25 5.144984e-28 6.179198e-53 7.601850e-179
CD123 4.366837e-03 6.240211e-93 2.803658e-123 1.796235e-01
CD135 4.594457e-12 1.293979e-86 2.791242e-25 1.123744e-135
CD19 4.500825e-23 5.333422e-91 0.000000e+00 5.479898e-115
CD24 4.219498e-26 1.103560e-42 0.000000e+00 1.725082e-213
CD3 1.098585e-13 0.000000e+00 1.959952e-63 4.450175e-207
CD33 2.274106e-35 1.886303e-66 1.839666e-97 0.000000e+00
CD34 1.195997e-19 1.551442e-253 9.119256e-49 0.000000e+00
CD38 1.471882e-11 2.642007e-297 4.716956e-01 1.029034e-286
CD44 2.804687e-25 2.985716e-147 3.090508e-134 0.000000e+00
CD45 9.103913e-01 6.930793e-303 0.000000e+00 2.985604e-02
CD56 2.831984e-01 3.516644e-239 8.907106e-01 1.512360e-188
CD90 4.489075e-10 0.000000e+00 3.876517e-72 1.078192e-225
HLA-DR 9.932539e-10 0.000000e+00 3.764308e-297 6.879625e-04
Mouse IgG1k 7.622775e-11 5.926297e-124 8.100341e-15 3.508137e-152
In [48]:
pval = pval + 10 ** -50 + pval
pvals = -np.log10(pval) * (tstat > 0)
In [49]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 10))
sns.heatmap(pvals.T, vmax=50, vmin=2)
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fda89119050>

Combined visualizations

Visualization for multiple assays at once
In [50]:
sample.clone_vs_analyte('protein')
In [51]:
sample.clone_vs_analyte('cnv')
In [52]:
# Filtering protein and cnv to improve the visualization

sample.protein = sample.protein[:, ['CD3', 'CD90']]
sample.cnv = sample.cnv[:, 58:85]
sample.clone_vs_analyte('protein')
In [53]:
# Certain clones can also be dropped, but they must be dropped from all assays
# Hence the sample object is sliced in this case
# In this case it is better to store the new sample in a separate variable

# This returns the dna barcodes with the given labels
select_bars = sample.dna.barcodes(['Clone D', 'Clone E', 'Clone F'])

sample_subset = sample[select_bars]
sample_subset.clone_vs_analyte('protein')
In [54]:
# The ids can also be reset to the entire set

sample.reset('cnv')
sample.reset('protein')
sample.clone_vs_analyte('protein')
In [64]:
sample.heatmap(clusterby='dna', sortby='protein', drop='cnv', flatten=False)

# Try the following
# sample.heatmap(clusterby='dna', sortby='protein', drop='cnv', flatten=True)
# sample.heatmap(clusterby='protein', sortby='dna', drop='cnv', flatten=False)
# sample.heatmap(clusterby='dna', sortby='protein', flatten=False)
Conclusion:

Data from h5 files can be effeciently manipulated,
visualized, and inferred using this multiomics
exploratory tools (mextools)